Compound splitting and lexical unit recombination for improved performance of a speech recognition system for German parliamentary speeches

نویسندگان

Martha Larson

Daniel Willett

Joachim Köhler

Gerhard Rigoll

چکیده

This paper proposes a novel combined compound splitting and phrase recombination method that optimizes the composition of the speech recognition lexicon for a given domain. Data-driven compound word splitting is followed by iterative recombination of high frequency combinations. Language model perplexity and size are the criteria used to identify a balance between compound decomposition, which reduces OOV, and lexical unit recombination, which packs additional context into a fixed-size vocabulary. The method provides a basis for lexicon design for a LVCSR system on the domain of German parliamentary speeches that is to be used as the foundation of a spoken document information retrieval system. The approach achieves a 35% reduction in OOV without a prohibitively large sacrifice in recognition performance.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Methods and Tools for Speech Data Acquisition exploiting a Database of German Parliamentary Speeches and Transcripts from the Internet

This paper describes methods that exploit stenographic transcripts of the German parliament to improve the acoustic models of a speech recognition system for this domain. The stenographic transcripts and the speech data are available on the Internet. Using data from the Internet makes it possible to avoid the costly process of the collection and annotation of a huge amount of data. The automati...

متن کامل

Open Domain Speech Recognition & Translation: Lectures and Speeches

For years speech translation has focused on the recognition and translation of discourses in limited domains, such as hotel reservations or scheduling tasks. Only recently research projects have been started to tackle the problem of open domain speech recognition and translation of complex tasks such as lectures and speeches. In this paper we present the on-going work at our laboratory in open ...

متن کامل

Constrained Subword Units for Speaker Recognition

Phonetic features have been proposed to overcome performance degradation in spectral speaker recognition in difficult acoustic conditions. The harmful effect of those conditions, however, is not restricted to spectral systems but also affects the performance of the open-loop phone recognisers on which phonetic systems are based. In automatic speech recognition, larger subword units and the use ...

متن کامل

Feature Exploration for Authorship Attribution of Lithuanian Parliamentary Speeches

This paper reports the first authorship attribution results based on the automatic computational methods for the Lithuanian language. Using supervised machine learning techniques we experimentally investigated the influence of different feature types (lexical, character, and syntactic) focusing on a few authors within three datasets, containing transcripts of the parliamentary speeches and deba...

متن کامل

The 2006 RWTH parliamentary speeches transcription system

In this work, investigations in the course of the developement of RWTH automatic speech recognition systems developed for the second TC-STAR evaluation campaign 2006 are presented. The systems were designed to transcribe parliamentary speeches taken from the European Parliament Plenary Sessions (EPPS) in European English and Spanish, as well as speeches from the Spanish Parliament. The RWTH sys...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2000

Compound splitting and lexical unit recombination for improved performance of a speech recognition system for German parliamentary speeches

نویسندگان

چکیده

منابع مشابه

Methods and Tools for Speech Data Acquisition exploiting a Database of German Parliamentary Speeches and Transcripts from the Internet

Open Domain Speech Recognition & Translation: Lectures and Speeches

Constrained Subword Units for Speaker Recognition

Feature Exploration for Authorship Attribution of Lithuanian Parliamentary Speeches

The 2006 RWTH parliamentary speeches transcription system

عنوان ژورنال:

اشتراک گذاری